Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based fusion strategy for filling the masked parts. It uniformly attends to the textural information on the unmasked regions and the reference frame. Then the semantic audio information is involved in enhancing the self-attention computation. Additionally, a refinement network with audio injection improves both image and lip-sync quality. Extensive experiments validate that our model can generate high-fidelity lip-synced results for arbitrary subjects.
translated by 谷歌翻译
Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE
translated by 谷歌翻译
In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from images simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries is designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves $91.04\%$ and $83.51\%$ in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone. Code will be availabl at \url{https://github.com/IDEA-Research/DQ-DETR}.
translated by 谷歌翻译
We present a unified hard-constraint framework for solving geometrically complex PDEs with neural networks, where the most commonly used Dirichlet, Neumann, and Robin boundary conditions (BCs) are considered. Specifically, we first introduce the "extra fields" from the mixed finite element method to reformulate the PDEs so as to equivalently transform the three types of BCs into linear forms. Based on the reformulation, we derive the general solutions of the BCs analytically, which are employed to construct an ansatz that automatically satisfies the BCs. With such a framework, we can train the neural networks without adding extra loss terms and thus efficiently handle geometrically complex PDEs, alleviating the unbalanced competition between the loss terms corresponding to the BCs and PDEs. We theoretically demonstrate that the "extra fields" can stabilize the training process. Experimental results on real-world geometrically complex PDEs showcase the effectiveness of our method compared with state-of-the-art baselines.
translated by 谷歌翻译
鉴于其广泛的应用,已经对人面部交换的任务进行了许多尝试。尽管现有的方法主要依赖于乏味的网络和损失设计,但它们仍然在源和目标面之间的信息平衡中挣扎,并倾向于产生可见的人工制品。在这项工作中,我们引入了一个名为StylesWap的简洁有效的框架。我们的核心想法是利用基于样式的生成器来增强高保真性和稳健的面部交换,因此可以采用发电机的优势来优化身份相似性。我们仅通过最小的修改来确定,StyleGAN2体系结构可以成功地处理来自源和目标的所需信息。此外,受到TORGB层的启发,进一步设计了交换驱动的面具分支以改善信息的融合。此外,可以采用stylegan倒置的优势。特别是,提出了交换引导的ID反转策略来优化身份相似性。广泛的实验验证了我们的框架会产生高质量的面部交换结果,从而超过了最先进的方法,既有定性和定量。
translated by 谷歌翻译
知识蒸馏是将知识从强大的教师转移到有效的学生模型的有效方法。理想情况下,我们希望老师越好,学生越好。但是,这种期望并不总是成真。通常,由于教师和学生之间的不可忽略的差距,更好的教师模型通过蒸馏导致不良学生。为了弥合差距,我们提出了一种渐进式蒸馏方法,以进行致密检索。产品由教师渐进式蒸馏和数据进行渐进的蒸馏组成,以逐步改善学生。我们对五个广泛使用的基准,MARCO通道,TREC Passage 19,TREC文档19,MARCO文档和自然问题进行了广泛的实验,其中POD在蒸馏方法中实现了密集检索的最新方法。代码和模型将发布。
translated by 谷歌翻译
人们在我们的日常互动中互相看待彼此或相互凝视是无处不在的,并且发现相互观察对于理解人类的社会场景具有重要意义。当前的相互视线检测方法集中在两阶段方法上,其推理速度受到两阶段管道的限制,第二阶段的性能受第一阶段的影响。在本文中,我们提出了一个新型的一阶段相互视线检测框架,称为相互视线变压器或MGTR,以端到端的方式执行相互视线检测。通过设计相互视线实例三元,MGTR可以检测每个人头边界框,并基于全局图像信息同时推断相互视线的关系,从而简化整个过程。两个相互视线数据集的实验结果表明,我们的方法能够加速相互视线检测过程而不会失去性能。消融研究表明,MGTR的不同组成部分可以捕获图像中不同级别的语义信息。代码可在https://github.com/gmbition/mgtr上找到
translated by 谷歌翻译
如今,基于变压器的模型逐渐成为人工智能先驱的默认选择。即使在几个镜头的情况下,这些模型也会显示出优势。在本文中,我们重新审视了经典方法,并提出了一种新的几次替代方法。具体而言,我们研究了几个镜头的单级问题,该问题实际上以已知样本为参考来检测未知实例是否属于同一类。可以从序列匹配的角度研究此问题。结果表明,使用元学习,经典序列匹配方法,即比较聚集,显着优于变压器。经典方法所需的培训成本要少得多。此外,我们在简单的微调和元学习下进行两种序列匹配方法之间进行了经验比较。元学习导致变压器模型的特征具有高相关尺寸。原因与变压器模型的层和头数密切相关。实验代码和数据可从https://github.com/hmt2014/fewone获得
translated by 谷歌翻译
弹药废料检查是回收弹药金属废料的过程中的重要步骤。大多数弹药由许多组件组成,包括盒子,底漆,粉末和弹丸。包含能量学的弹药废料被认为是潜在危险的,应在回收过程之前分离。手动检查每片废料都是乏味且耗时的。我们已经收集了一个弹药组件的数据集,目的是应用人工智能自动对安全和不安全的废料进行分类。首先,通过弹药的视觉和X射线图像手动创建两个培训数据集。其次,使用直方图均衡,平均,锐化,功率定律和高斯模糊的空间变换来增强X射线数据集,以补偿缺乏足够的训练数据。最后,应用代表性的Yolov4对象检测方法用于检测弹药组件并分别将废料片分别为安全和不安全的类。训练有素的模型针对看不见的数据进行了测试,以评估应用方法的性能。实验证明了使用深度学习的弹药组件检测和分类的可行性。数据集和预培训模型可在https://github.com/hadi-ghnd/scrap-classification上获得。
translated by 谷歌翻译
在计算机视觉和图形社区中,长期以来,现实的生成面部视频综合一直是追求。但是,现有的视频生成方法倾向于产生具有漂移的面部身份和不自然运动的低质量框架。为了应对这些挑战,我们提出了一个名为stylefacev的原则框架,该框架生产具有生动动作的高保真身份的面部视频。我们的核心洞察力是分解外观并构成信息,并在StyleGan3的潜在空间中重新组装它们,以产生稳定而动态的结果。具体而言,stylegan3为高保真的面部图像生成提供了强大的先验,但潜在空间本质上是纠缠的。通过仔细检查其潜在特性,我们提出了分解和重组设计,从而可以使面部外观和运动的结合结合在一起。此外,依赖时间依赖的模型是建立在分解的潜在特征的基础上的,并示例了能够生成现实且具有时间连贯的面部视频的合理运动序列。特别是,我们的管道对静态图像和高质量视频数据的联合培训策略进行了培训,该策略具有更高的数据效率。广泛的实验表明,我们的框架可以在定性和定量上实现最先进的视频生成。值得注意的是,即使没有高分辨率培训视频,StyleFacev也能够生成现实$ 1024 \ times1024 $面对视频。
translated by 谷歌翻译